Introduction

Welcome to our workshop on inferential modelling with wide data. We hope you enjoy the session.

This workshop will cover the problems associated with inferential modelling of high dimensional, wide data, suggest approaches to overcome them and provide hands-on training in the implementation of regularisation techniques, covariate selection stability and triangulation.

The following packages are required for these exercises.

library(Hmisc)
library(broom)
library(glmnet)
library(ncvreg)
library(bigstep)
library(rsample)
library(tidyverse)
library(stabiliser)

Simulating data

In order to appreciate the issues we will be discussing today, we have provided functions to simulate datasets for exploration.

The following function generates a dataset with “ncols” as the number of variables and “nrows” as the number of rows.

generate_uncor_variables <- function(ncols, nrows) {
  data_frame(replicate(ncols, rnorm(nrows, 0, 1)))
}

A dataset with 197 rows and 130 variables can then be generated using this function as follows:

variables <- generate_uncor_variables(ncols = 130, nrows = 197)

This results in the following dataset being generated:

variables

We can also generate an outcome variable, in this case randomly generated in the same manner, but renaming as “outcome”

generate_uncor_outcome <- function(nrows) {
  data_frame(replicate(1, rnorm(nrows, 0, 1))) %>%
    rename("outcome" = 1)
}

outcome <- generate_uncor_outcome(nrows = 197)

outcome

We can now bind together the uncorrelated, randomly generated variables, with the randomly generated outcome.

df_no_signal <- outcome %>%
  bind_cols(variables)

This results in a dataset of 197 rows, with a single outcome variable, which has no relationship to the 130 columns as shown below.

df_no_signal

Conventional approaches

The following function conducts univariable analysis to determine the association between a given variable and the outcome. A pearson/spearman rank correlation matrix is another option for this.

univariable_analysis <- function(data, variable) {
  data %>%
    lm(outcome ~ variable, .) %>%
    tidy() %>%
    filter(term != "(Intercept)")
}

This function can then be applied using map_df() to each column of the dataset individually and return a dataframe.

univariable_outcomes <- map_df(df_no_signal, ~ univariable_analysis(data = df_no_signal, variable = .), .id = "variable")

A conventional approach would then filter at a given threshold (for example P<0.2).

univariable_outcomes_filtered <- univariable_outcomes %>%
  filter(p.value < 0.2)

This results in a table below of all of the variables that have a p-value of <0.2 to be carried forward into a multivariable model.

univariable_outcomes_filtered

A list of variables to be included is as follows:

variables_for_stepwise <- univariable_outcomes_filtered %>%
  pull(variable)

These variables would subsequently be offered into a stepwise selection process such as the following

stepwise_model <- function(data, variables) {
  data_selected <- data %>%
    select(variables)

  lm(outcome ~ ., data = data_selected) %>%
    step(., direction = "backward", trace = FALSE) %>%
    tidy() %>%
    filter(p.value < 0.05) %>%
    rename(variable = term)
}

prefiltration_results <- stepwise_model(data = df_no_signal, variables = variables_for_stepwise)

prefiltration_results

Data with a true signal

We will test a variety of models on this dataset. For future comparison let’s set up a list where we can store model results

model_results <- list()
generate_data_with_signal <- function(nrow, ncol, n_causal_vars, amplitude) {
  # Generate the variables from a multivariate normal distribution
  mu <- rep(0, ncol)
  rho <- 0.25
  sigma <- toeplitz(rho^(0:(ncol - 1))) #  Symmetric Toeplitz Matrix
  X <- matrix(rnorm(nrow * ncol), nrow) %*% chol(sigma) # multiply matrices Choleski Decomposition. Description. Compute the Choleski factorization of a real symmetric positive-definite square matrix)

  # Generate the response from a linear model
  nonzero <- sample(ncol, n_causal_vars) # select the id of 'true' variables
  beta <- amplitude * (1:ncol %in% nonzero) / sqrt(nrow) # vector of effect sizes to pick out true varaiables
  beta_value <- amplitude / sqrt(nrow)
  outcome.sample <- function(X) X %*% beta + rnorm(nrow) # calculate outcome from true vars and error
  outcome <- outcome.sample(X)

  ## Rename true variables
  X_data <- as.data.frame(X)
  for (i in c(nonzero)) {
    X_data1 <- X_data %>%
      rename_with(.cols = i, ~ paste("causal_", i, sep = ""))
    X_data <- X_data1
  }

  dataset_sim <- as.data.frame(cbind(outcome, X_data1))
}

We can now simulate a dataset df_signal with 300 rows and 300 variables, 8 of which have a relationship with the outcome.

We can also alter the signal strenght of causal variables by changing the “amplitute” paramater.

df_signal <- generate_data_with_signal(nrow = 300, ncol = 300, n_causal_vars = 8, amplitude = 7)

“Cheat” model

As we have simulated this dataset, we can “cheat” and create the perfect model to check everything’s worked correctly.

df_signal %>%
  select(outcome, contains("causal_")) %>%
  lm(outcome~., data=.) %>%
  tidy()

Conventional stepwise approach

We can now repeat out prefiltration and stepwise selection approach as before

univariable_outcomes <- map_df(df_signal, ~ univariable_analysis(data = df_signal, variable = .), .id = "variable")
univariable_outcomes_filtered <- univariable_outcomes %>%
  filter(p.value < 0.2)
variables_for_stepwise <- univariable_outcomes_filtered %>%
  pull(variable)
model_results$prefiltration <- stepwise_model(data = df_signal, variables = variables_for_stepwise)
model_results$prefiltration

Regularisation

model_lasso <- function(data) {
  y_temp <- data %>%
    select("outcome") %>%
    as.matrix()

  x_temp <- data %>%
    select(-"outcome") %>%
    as.matrix()

  fit_lasso <- cv.glmnet(x = x_temp, y = y_temp, alpha = 1)

  coefs <- coef(fit_lasso, s = "lambda.min")

  data.frame(name = coefs@Dimnames[[1]][coefs@i + 1], coefficient = coefs@x) %>%
    rename(
      variable = name,
      estimate = coefficient
    ) %>%
    filter(variable != "(Intercept)") %>%
    select(variable, estimate)
}

model_results$lasso <- model_lasso(df_signal)

MCP can also be used

model_mcp <- function(data) {
  y_temp <- data %>%
    select("outcome") %>%
    as.matrix()

  x_temp <- data %>%
    select(-"outcome")

  fit_mcp <- cv.ncvreg(X = x_temp, y = y_temp)

  fit_mcp %>%
    coef() %>%
    as_tibble(rownames = "variable") %>%
    rename(
      estimate = value
    ) %>%
    filter(
      variable != "(Intercept)",
      estimate != 0,
      !grepl("(Intercept)", variable),
      !grepl("Xm[, -1]", variable)
    ) %>%
    mutate(variable = str_remove_all(variable, "`"))
}

model_results$mcp <- model_mcp(df_signal)

MBIC can also be used

model_mbic <- function(data) {
  y_temp <- data %>%
    select("outcome") %>%
    as.matrix()

  x_temp <- data %>%
    select(-"outcome")

  bigstep_prepped <- bigstep::prepare_data(y_temp, x_temp, verbose = FALSE)

  bigstep_prepped %>%
    reduce_matrix(minpv = 0.01) %>%
    fast_forward(crit = mbic) %>%
    multi_backward(crit = mbic) %>%
    summary() %>%
    stats::coef() %>%
    as.data.frame() %>%
    rownames_to_column(., var = "variable") %>%
    mutate(variable = str_remove_all(variable, "`")) %>%
    filter(
      !grepl("(Intercept)", variable),
      !grepl("Xm[, -1]", variable)
    ) %>%
    rename(estimate = Estimate) %>%
    select(variable, estimate)
}

model_results$mbic <- model_mbic(df_signal)

A comparison of the number of True/False positives is shown below:

calculate_tp_fp <- function(results) {
  results %>%
    mutate(causal = case_when(
      grepl("causal", variable) ~ "tp",
      !grepl("causal", variable) ~ "fp"
    )) %>%
    group_by(model) %>%
    summarise(
      tp = sum(causal == "tp", na.rm = TRUE),
      fp = sum(causal == "fp", na.rm = TRUE)
    ) %>%
    mutate("total_selected" = tp + fp)
}

conventional_results <- model_results %>%
  bind_rows(., .id = "model") %>%
  calculate_tp_fp()

Stability selection

Function for bootstrapping

boot_sample <- function(data, boot_reps) {
  rsample::bootstraps(data, boot_reps)
}

bootstrapped_datasets <- boot_sample(data = df_signal, boot_reps = 10)

Bootstrapped data is presented here as a table of 10 different nested tables.

bootstrapped_datasets

If we extract a single bootstrapped dataset and sort by the outcome, we can see that several rows have been resampled. Consequently as the dataset length is the same as the original, several rows will be omitted completely.

bootstrapped_datasets$splits[[1]] %>%
  as_tibble() %>%
  arrange(outcome)

Model for bootstraps

We can apply our previous lasso function over each one of these bootstrapped resamples.

model_lasso_bootstrapped <- bootstrapped_datasets %>%
  map_df(.x = .$splits, .f = ~ as.data.frame(.) %>% model_lasso(.), .id = "bootstrap")

Permutation

To identify a null threshold, first we must permute the outcome.

Our original dataset looks like this:

df_signal

By permuting the outcome variable we sever all ties between the outcome and the explanatory variables. We might want to conduct this 5 times.

permuted_datasets <- rsample::permutations(data = df_signal, permute = outcome, times = 5)

A single dataset would look like this. Note the structure of the explanatory variables remains the same, but the outcome is randomly shuffled.

permuted_datasets$splits[[1]] %>%
  as_tibble()

We can then apply our bootstrap function to each one of these 5 permuted datasets. We might perform 10 bootstrap samples for each of the 5 permuted datasets. The model would then be applied to each dataset within the following table.

permuted_bootstrapped_datasets <- permuted_datasets %>%
  map_df(.x = .$splits, .f = ~ as.data.frame(.) %>% boot_sample(., boot_reps = 10), .id = "permutation")

permuted_bootstrapped_datasets

This code is relatively lengthy, and is therefore deliberately omitted from the workshop, however is present within the stabiliser package and freely available.

Using the ecdf() function, the stability of each variable within each permutation can be calculated.

By choosing the value where quantile(., probs=1), the highest stability that might have occurred by chance can be calculated.

The mean threshold across all permutations can then be calculated. This represents the “null threshold”; i.e., the mean highest stability a variable might acheive across all permuted datasets (where we know there should be no links between variables and outcome).

Variables in the true model (i.e., non-permuted data) that have a stability value in excess of this null threshold are highly likely to be truly correlated with the outcome.

stabiliser approach

stab_output <- stabilise(outcome = "outcome", data = df_signal, models = c("mbic"), type = "linear")

stab_output$mbic$stability

All models

The stabiliser package allows multiple models to be run simultaneously. Just select the models you wish to run in the “models” argument.

MCP is omitted here for speed. To include it, just add it to the list of models using: models = c(“mbic”, “lasso”, “mcp”)

stab_output <- stabilise(outcome = "outcome", data = df_signal, models = c("mbic", "lasso"), type = "linear")

Results

Calculate the number of true and false positives selected through stability approaches, and rename columns to include “_stability”.

stability_results <- stab_output %>%
  map_df(., ~ .x$stability, .id = "model") %>%
  filter(stable == "*") %>%
  calculate_tp_fp(.) %>%
  rename_all(., ~ paste0(., "_stability"))

stability_results

Compare this with the non-stability approach

conventional_results %>%
  left_join(stability_results, by = c("model" = "model_stability"))

Triangulation

The stabiliser package allows the stability selection results from multiple models to be used synergistically, and by leveraging the strenghts of various models, a more robust method of variable selection is often acheived.

triangulated_stability <- triangulate(stab_output)

triangulated_stability
## $combi
## $combi$stability
## # A tibble: 301 x 4
##    variable   stability bootstrap_p stable
##    <chr>          <dbl>       <dbl> <chr> 
##  1 causal_178     100             0 *     
##  2 causal_257     100             0 *     
##  3 causal_98      100             0 *     
##  4 causal_234      99.5           0 *     
##  5 causal_236      99.5           0 *     
##  6 causal_199      98.5           0 *     
##  7 causal_237      98             0 *     
##  8 causal_282      96.5           0 *     
##  9 V266            46             0 <NA>  
## 10 V78             46             0 <NA>  
## # ... with 291 more rows
## 
## $combi$perm_thresh
## [1] 55
stab_plot(triangulated_stability)
## $combi

No signal datasets

We can now return to our original dataset that we simulated to have no signal.

Our conventional approach performed relatively poorly, selecting the following variables as being significantly associated with the outcome variable.

prefiltration_results

Using stabiliser, the following variables are selected from the dataset.

stab_output_no_signal <- stabilise(outcome = "outcome", data = df_no_signal, models = c("mbic", "lasso"), type = "linear")

triangulated_output_no_signal <- triangulate(stab_output_no_signal)

triangulated_output_no_signal$combi$stability %>%
  filter(stable == "*")

Conclusions

Thank you for attending this workshop. We hope you enjoyed the session, and have a good understanding of where some conventional modelling approaches might not be appropriate in wider datasets.

If you have any further questions after the workshop, please feel free to contact Martin Green () or Robert Hyde ().